Version: V11

Understanding Transcriptions and Translations Generation via VIDIZMO Indexer

The VIDIZMO Indexer enables you to generate transcription and translations for your audio and video files. The Indexer automatically detects languages spoken in your content and generates transcriptions accordingly. If the original transcription are in a language other than English, you can configure the applicaiton to generate English translations as well.

The Indexer also supports multilingual transcription, enabling it to process content in multiple languages simultaneously. If your content features various languages spoken at the same time, the Indexer will identify each language and generate the corresponding transcriptions.

Concept

When you choose an audio or video file for processing with 'Auto-Detect' selected, the AI model set up in the VIDIZMO Indexer app detects and predicts the language during playback. Once the language is detected and predicted, the app will generate the transcript in that language for the entire playback duration of the audio or video file.

The accuracy of transcriptions for a language depends on that specific language's word error rate (WER). A lower WER indicates that the transcriptions generated for that language are more accurate. In addition, your application can also impact the accuracy of the transcriptions generated.

Here is a list of languages supported by the VIDIZMO Indexer, along with their respective WER when the Large-V1 model is used. The performance of the Indexer and the accuracy of its Insights vary depending on the model size. The following sections of this article provide more information about the models. The languages in this list arranged from most accurate (least WER) to less accurate (most WER).

Serial No.	Language	WER
1	Spanish	3.5
2	Italian	4.2
3	English	4.5
4	Portuguese	4.8
5	German	5.5
6	Japanese	6.4
7	Russian	6.4
8	Polish	7.2
9	French	7.7
10	Catalan	8.0
11	Dutch	8.3
12	Indonesian	8.5
13	Turkmen	9.4
14	Turkish	9.4
15	Malay	10.2
16	Ukrainian	10.3
17	Swedish	10.5
18	Vietnamese	10.7
19	Norwegian	11.4
20	Finnish	12.2
21	Thai	13.2
22	Korean	15.2
23	Romanian	15.4
24	Slovak	15.7
25	Tagalog	15.8
26	Croatian	16.7
27	Danish	16.8
28	Czech	17.4
29	Arabic	18.1
30	Bulgarian	18.4
31	Greek	18.7
32	Galician	19.0
33	Chinese	19.6
34	Macedonian	20.6
35	Tamil	20.6
36	Bosnian	20.7
37	Hungarian	21.0
38	Urdu	25.0
39	Estonian	25.5
40	Hindi	26.9
41	Slovenian	27.8
42	Latvian	28.3
43	Azerbaijani	28.7
44	Serbian	29.2
45	Hebrew	30.2
46	Lithuanian	35.2
47	Persian	36.1
48	Welsh	36.6
49	Afrikaans	42.6
50	Icelandic	43.0
51	Marathi	43.7
52	Kazakh	43.8
53	Māori	45.7
54	Swahili	47.9
55	Nepali	52.2
56	Armenian	53.7
57	Belarusian	56.6
58	Kannada	69.8
59	Tajik	74.5
60	Occitan	75.9
61	Lingala	76.8
62	Maltese	80.5
63	Luxembourgish	86.5
64	Hausa	87.0
65	Javanese	87.0
66	Pashto	92.7
67	Uzbek	93.3
68	Khmer	96.0
69	Georgian	100.5
70	Telugu	100.6
71	Malayalam	101.4
72	Lao	101.6
73	Punjabi	102.8
74	Somali	103.5
75	Gujarati	103.9
76	Bengali	104.9
77	Assamese	105.6
78	Mongolian	106.2
79	Yoruba	111.7
80	Amharic	129.3
81	Shona	130.0
82	Sindhi	177.9

In addition to the languages above, the VIDIZMO Indexer has the capacity to generate insights for some additional rare languages. However, due to their scarcity and inadequate training data, the estimated Word Error Rate (WER) for these languages is high and may create unusual or insufficient results.

Bashkir
Tibetan
Breton
Basque
Faroese
Hawaiian
Haitian Creole
Latin
Malagasy
Burmese
Norwegian Nynorsk
Sanskrit
Sinhalese
Albanian
Sundanese
Tatar
Yiddish
Cantonese

The VIDIZMO Indexer exclusively handles videos that have audio. If speech is detected in that audio, the app generates transcripts accordingly. When the input is a video, the VIDIZMO Indexer separates the audio component from the video and then performs the transcription process solely on the audio portion.

When the VIDIZMO Indexer has Transcriptions in its Insights, you can’t use the AWS Indexer App or Azure Video Analyzer ARM for transcriptions. Additionally, you can't activate the VIDIZMO Indexer, if either the AWS Indexer App or Azure Video Analyzer ARM are enabled with the option to generate Transcriptions.

Audio Translation Concept

If the user has also opted for Audio Translation, the VIDIZMO Indexer app will translate the detected language in the audio or video file. These translations will be present in the transcription pane of your file's playback page. If both Transcription and Audio Translation are selected, the VIDIZMO Indexer generates both Insights which the user can choose to see from the transcription pane. As of now, the VIDIZMO Indexer can only perform translations in English.

If an audio or video file doesn't have any transcriptions, processing it with the translation Insight generates the English translations, regardless of the spoken language in it. Portal content that already has transcriptions, either from another indexing application or the user uploading a .vtt file, can still be processed for the English translations.

The features offered by the VIDIZMO Indexer app utilize AI processing as a consumption metric for your VIDIZMO Account. To learn how you can view consumption reports, refer to Consumption Reports for SaaS Deployment Overview.

Document and Image Translation

In addition to translating speech-driven transcriptions, the VIDIZMO Indexer can translate the text on your documents and images present in your Portal. When Document Translation or Image Translation enabled in the VIDIZMO Indexer’s Insights, supported documents and images are automatically translated using the same processing modes (automatic processing when new items are added, or on-demand via upload and the Process modal). Translation works directly on the document or image text; it doesn’t require audio or an existing transcript. Translations count toward your AI processing usage.

Document and image translation works differently from audio translation: the VIDIZMO Indexer first uses OCR (required) to extract visible text, detects the source language, and then translates the extracted text—no audio or existing transcript is needed.

Supported languages for document and image translation:

English
French
German
Hindi
Italian
Portuguese
Spanish
Thai

For step-by-step instructions, see the Translating Documents and Images using VIDIZMO Indexer section in How to Generate Transcriptions and Translations using VIDIZMO Indexer.

Processing

To generate accurate transcriptions or translations for your audio and video files, the VIDIZMO Indexer aims to minimize substitutions, insertions, and deletions, which contributes to reducing the overall Word Error Rate (WER). The less the WER of a transcription, the more accurate it will be.

In the first step of the indexing process, the raw audio inputs from your audio or video files are converted into a log-Mel spectrogram using a feature extractor. The system then maps these audio spectrogram features to a sequence of text tokens that have encoder-hidden states. These text tokens are then decoded regressively by an internal language model (LM).

You can also configure the application to automatically process any audio or video file uploaded or added to your Portal via other means. Learn how to do this configuration and more by visiting: Configuring VIDIZMO Indexer for Transcription and Translation

Conclusion

The VIDIZMO Indexer aims to generate reliable, accurate transcriptions, translations and AI insights for all your audio and video files. You can also configure the application to automatically generate insights for files that you add to the Portal using the same configured settings. You can also manually create and regenerate transcriptions with different parameters from your Portal if you are unsatisfied with the results.

For a guide to a more practical or hands-on approach to the transcription generation process, visit How to Generate Transcriptions and Translations using VIDIZMO Indexer

Concept​

Audio Translation Concept​

Document and Image Translation​

Processing​

Conclusion​

Concept

Audio Translation Concept

Document and Image Translation

Processing

Conclusion